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ABSTRACT 

Naive Bayes is a simple Bayesian classifier with strong in- 
dependence assumptions among the attributes. This clas- 
sifier, despite its strong independence assumptions, often 
performs well in practice. It is believed that relaxing the in- 
dependence assumptions of a naive Bayes classifier may im- 
prove the classification accuracy of the resulting structure. 
While finding an optimal unconstrained Bayesian Network 
(for most any reasonable scoring measure) is an NP-hard 
problem, it is possible to learn in polynomial time optimal 
networks obeying various structural restrictions. Several au- 
thors have examined the possibilities of adding augmenting 
arcs between attributes of a Naive Bayes classifier. Fried- 
man, Geiger and Goldszmidt define the TAN structure in 
which the augmenting arcs form a tree on the attributes, and 
present a polynomial time algorithm that learns an optimal 
TAN with respect to MDL score. Keogh and Pazzani define 
Augmented Bayes networks in which the augmenting arcs 
form a forest on the attributes, and present heuristic search 
methods for learning good, though not optimal, augmenting 
arc sets. In this paper, we present a simple, polynomial time 
greedy algorithm for learning an optimal Augmented Bayes 
Network with respect to MDL score. 

Categories and Subject Descriptors 

1.5 [Computer Methodologies]: Pattern Recognition; 1.5.2 
[Pattern Recognition]: Classifier design and evaluation 
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1. INTRODUCTION 

Classification is a machine learning task that requires con- 
struction of a function that classifies examples into one of a 
discrete set of possible categories. Formally, the examples 
are vectors of attribute values and the discrete categories are 
the class labels. The construction of the classifier function 



is done by training on preclassified instances of a set of at- 
tributes. This kind of learning is called supervised learning 
as the learning is based on labeled data. A few of the var- 
ious approaches for supervised learning are artificial neural 
networks, decision tree learning, support vector machines 
and Bayesian networks EJ. All these methods are compa- 
rable in terms of classification accuracy. Bayesian networks 
are especially important because they provide us with useful 
information about the structure of the problem itself. 

One highly simple and effective classifier is the naive Bayes 
classifier 0- The naive Bayes classifier is based on the as- 
sumption that the attribute values are conditionally inde- 
pendent of each other given the class label. The classifier 
learns the probability of each attribute Xi given the class 
C from the preclassified instances. Classification is done by 
calculating the probability of the class C given all attributes 
Xi, X2, X n . The computation of this probability is made 
simple by application of Bayes rule and the rather naive as- 
sumption of attribute independence. In practical classifica- 
tion problems, we hardly come across a situation where the 
attributes are truly conditionally independent of each other. 
Yet the naive Bayes classifier performs well as compared to 
other state-of-art classifiers. 

An obvious question that comes to mind is whether re- 
laxing the attribute independence assumption of the naive 
Bayes classifier will help improve the classification accuracy 
of Bayesian classifiers. In general, learning a structure (with 
no structural restrictions) that represents the appropriate 
attribute dependencies is an NP-Hard problem. Several au- 
thors have examined the possibilities of adding arcs (aug- 
menting arcs) between attributes of a naive Bayes classi- 
fier that obey certain structural restrictions. For instance, 
Friedman, Geiger and Goldszmidt 2! define the TAN struc- 
ture in which the augmenting arcs form a tree on the at- 
tributes. They present a polynomial time algorithm that 
learns an optimal TAN with respect to MDL score. Keogh 
and Pazzani 4 define Augmented Bayes networks in which 
the augmenting arcs form a forest on the attributes (a collec- 
tion of trees, hence a relaxation of the structural restriction 
of TAN), and present heuristic search methods for learning 
good, though not optimal, augmenting arc sets. The au- 
thors, however, evaluate the learned structure only in terms 
of observed misclassification error and not against a scoring 
metric, such as MDL. Sacha in his dissertation (unpublished, 
http://jbnc.sourceforge.net/JP_Sacha_PhD_Dissertat i 
on.pdf), defines the same problem as Forest Augmented 




Figure 1: A simple Augmented Bayes Network 

Naive Bayes (FAN) and presents polynomial time algorithm 
for finding good classifiers with respect to various quality 
measures (not MDL). The author however, does not claim 
the learned structure to be optimal with respect to any qual- 
ity measure. 

In this paper, we present a polynomial time algorithm for 
finding optimal Augmented Bayes Networks/Forest Aug- 
mented Naive Bayes with respect to MDL score. The rest 
of the paper is organized as follows. In section 2, we define 
the Augmented Bayes structure. Section 3, defines the MDL 
score for Bayesian Networks. The reader is referred to the 
Friedman paper J5J for details on MDL score, as we present 
only the necessary details in section 3. Section 4 provides 
intuition about the problem and Section 5 and 6 present the 
polynomial time algorithm and prove that its optimal. 

2. AUGMENTED BAYES NETWORKS 

The Augmented Bayes Network (ABN) structure is defined 
by Keogh and Pazzani |lj as follows: 

• Every attribute Xi has the class attribute C as its 
parent. 

• An attribute Xi may have at most one other attribute 
as its parent. 

Note that, the definition is similar to the TAN definition 
given in The difference is that whereas TAN necessarily 
adds n — 1 augmenting arcs (where n is the number of at- 
tributes); ABN adds any number of augmenting arcs up to 
n — 1. Figure 1 shows a simple ABN. The dashed arcs rep- 
resent augmenting arcs. Note that attributes 1 and 5 in the 
figure do not have any incoming augmenting arcs. Thus the 
ABN structure does not enforce the tree structure of TAN, 
giving more model flexibility. 

3. BACKGROUND 

In this section we present the definitions of Bayesian net- 
work and its MDL score. This section is derived from the 
Friedman paper We refer the reader to the paper [5] for 
more information as we only present the necessary details. 

A Bayesian network is an annotated directed acyclic graph 
(DAG) that encodes a joint probability distribution of a do- 



main composed of a set of random variables (attributes). 
Let U = {Xi, X n } be a set of n discrete attributes where 
each attribute Xi takes values from a finite domain. Then, 
the Bayesian network for U is the pair B =< G, >, 
where G is a DAG whose nodes correspond to the attributes 
Xi, X n and whose arcs represent direct dependencies be- 
tween the attributes. The graph structure G encodes the 
following set of independence assumptions: each node Xi 
is independent of its non-descendants given its parents in 
G. The second component of the pair contains a parame- 
ter Q Xi m x . = P(xi\H mi ) for each possible value Xi of Xi and 
U Xi of Iljfj ■ B defines a unique joint probability distribution 
over U defined by: 

n 

Pb(X 1 ,...,X„) = Y[P b (X 1 \U Xz ) 

i=i 

The problem of learning a Bayesian network can be stated as 
follows. Given a training set D = {u\, ...,itjv} of instances 
of U, find a network that best fits D. 

We now review the Minimum Description Length (MDL) 
of a Bayesian Network. As mentioned before, our al- 
gorithm learns optimal ABNs with respect to MDL score. 
The MDL score casts learning in terms of data compression. 
The goal of the learner is to find a structure that facilitates 
the shortest description of the given data [5] [3]. Intuitively, 
data having regularities can be described in a compressed 
form. In context of Bayesian network learning, we describe 
the data using DAGs that represent dependencies between 
attributes. A Bayesian network with the least MDL score 
(highly compressed) is said to model the underlying distribu- 
tion in the best possible way. Thus the problem of learning 
Bayesian networks using MDL score becomes an optimiza- 
tion problem. The MDL score of a Bayesian network B is 
defined as 

MDL(B) = | - B|1 ° gJV -JV^ J(X i; nxJ (1) 

i 

where, N is the number of instances of the set of attributes, 
\B\ is number of parameters in the Bayesian network B, 
n is number of attributes, and I(Xi\Yl Xi ) is the mutual 
information between an attribute Xi and its parents in the 
network. As per the definition of the ABN structure, the 
class attribute does not have any parents. Hence we have 
I(C;Hc) = 0. Also, each attribute has as its parents the 
class attribute and at most one other attribute. Hence for 
the ABN structure, we have 

n n n 

^7(A 1 ;nx 1 )= I(Xi;n Xi ,C)+ £ I(Xi;C) 

i i,|7r(i)|=2 i,|7r(i)| = 1 

(2) 

The first term on R.H.S in equation (2) represents all at- 
tributes with an incoming augmenting arc. The second term 
represents attributes without an incoming augmenting arc. 
Consider the chain law for mutual information given below 

I(X;Y,Z)=I(X;Z)+I(X;Y\Z) (3) 

Applying the chain law to the first term on R.H.S of equation 
(2) we get 

n n n 

Y d I{X i -Il Xi )= Yl I(X l ;n Xi \C) + YHX l ;C) (4) 

i i,|jr(i)|=2 i 



For any ABN structure, the second term of equation (4) - 
53™ I{Xi; C) is a constant. This is because, the term repre- 
sents the arcs from the class attribute to all other attributes 
in the network, and these arcs are common to all ABN struc- 
tures (as per the definition). Using equations (1) and (4), we 
rewrite the non-constant terms of the MDL score for ABN 
structures as follows 



MDL{B Aug ) = 



\B A ug\\0gN 



N J2 I(Xi;U Xi \C) 

i,|7r(i)|=2 



where, Bauo denotes an ABN structure. 



(5) 



4. SOME INSIGHTS 

Looking at the MDL score given in equation (5), we present 
a few insights on the learning ABN problem. The first term 
of the MDL equation - Ba " 8 J log — represents the length of 
the ABN structure. Note that the length of any ABN struc- 
ture depends only on the number of augmenting arcs, as the 
rest of the structure is the same for all ABNs. If we annotate 
the augmenting arcs with mutual information between the 
respective head and tail attributes, then the second term 
- N \ttU)\=2 Ilxj |C) represents the sum of costs of 

all augmenting arcs. Since the best MDL score is the mini- 
mum score, our problem can be thought of as balancing the 
number of augmenting arcs against the sum of costs of all 
augmenting arcs, where we wish to maximize the total cost. 

The MDL score for ABN structures is decomposable on at- 
tributes. We can rewrite equation (5) as 



E 



|Xi|logiV 



NI(Xi;U Xi \C) 



(6) 



where are the number of parameters stored at attribute 
Xi. The number of parameters stored at attribute Xi de- 
pends on the number of parents of Xi in Bau 3 , and hence 
on whether Xi has an incoming augmenting arc. Since we 
want to minimize the MDL score of our network, we should 
add an augmenting arc to an attribute Xi only if its cost 
I(Xj; Xi\C) dominates the increase in the number of pa- 
rameters of Xi. For example, consider an attribute Xi 
with no augmenting arc incident on it. Then the number 
of parameters stored at the attribute Xi in ABN will be 
l|C||(ll^»ll — l)i where ||C|| and \\Xi\\ are the number of 
states of the attributes C and Xi respectively. Thus \Xi\ = 
1 1 C 1 1 (I \Xi 1 1 — 1). If now an augmenting arc e = (Xj,Xi) hav- 
ing a cost of cost(e) = I(Xi]Xj\C) = I(Xf,Xi\C) is made 
incident on the attribute Xi, then the number of parame- 
ters stored at Xi will be \Xi \ = \\Xj\\.\\C\\.(\\Xi\\ — l), where 
| \Xj 1 1 is the number of states of the attribute Xj . Note that 
the addition of the augmenting arc has increased the num- 
ber of parameters of the network. Since we want to add an 
augmenting arc on Xi only if it reduces the MDL score, the 
following condition must be satisfied 

Xt\\ - i)io g iv > (H^ii.iicii-diXiii-i^iogJv Ncost(e 



Note that this equivalence implies that the overall change 
in MDL score is independent of the arc direction. That 
is, adding an augmenting arc (Xi,Xj) changes the network 
score identically to adding the arc (Xj, Xi). Thus any aug- 
menting arc is eligible to be added to an ABN structure if 
it has a cost at least the defined threshold Tr and if it does 
not violate the ABN structure. Note that, this threshold 
depends only on the number of discrete states of the at- 
tributes and the number of cases in the input database, and 
is independent of the direction of the augmenting arc. We 
now present a polynomial time greedy algorithm for learning 
optimal ABN with respect to MDL score. 

5. THE ALGORITHM 

1. Construct a complete undirected graph G = (V,E), 
such that V is the set of attributes (excluding the class 
attribute). 

2. For each edge e = £ G, compute cost(e) = 
I(Xi; Xj\C). Annotate e with cost(e). 

3. Remove from the graph G any edges that have a cost 
less than the threshold Tr. This will possibly make 
the graph G unconnected. 

4. Run the KruskaPs Maximum Spanning Tree algorithm 
on each of the connected components of G. This will 
make G a maximum cost forest (a collection of maxi- 
mum cost spanning trees). 

5. For each tree in G, choose a root attribute and set 
directions of all edges to be outward from the root 
attribute. 

6. Add the class variable as a vertex C to the set V and 
add directed edges from C to all other vertices in G. 

7. Return G. 



2 2 
which is equivalent to 

Hcikii^ii-ixiiaq-ii-i) 



cost(e) > 



2N 



(7) 

logiV = T fl (8) 



The algorithm constructs an undirected graph G in which 
all edges have costs above the defined threshold Tr. As 
seen in the previous section, all edges having costs greater 
than the threshold improve the overall score of the ABN 
structure. Running the Maximum Spanning Tree algorithm 
on each of the connected components of G ensures that the 
ABN structure is preserved and at the same time maximizes 
the second term of the MDL score given in equation (5). 
Note that, if in step 3 of the algorithm the graph G remains 
connected, our algorithm outputs a TAN structure. In this 
sense, our algorithm can be thought of as a generalization of 
the TAN algorithm given in |2j. The next section proves that 
the Augmented Bayes structure output by our algorithm is 
optimal with respect to the MDL score. 

6. PROOF 

We prove that the ABN output by our algorithm is optimal 
by making the observation that no optimal ABN can contain 
any edge that was removed in step 3 of the algorithm. This 
)is because, removing any such edge lowers the MDL score 
and leaves the structure an ABN. Consequently, an optimal 
ABN can contain only those edges that remain after step 3 
of the algorithm. If an optimal ABN does not connect some 
connected component of the graph G that results following 
step 3, edges with costs greater than or equal to Tr can 



be added without increasing overall MDL score until the 
component is spanned. Hence there exists an optimal ABN 
that spans each component of the graph G that results from 
step 3. By the correctness of Kruskal's algorithm run on 
each connected component to find a maximum cost spanning 
tree, an optimal ABN is found. Thus the ABN output by 
our algorithm is an optimal ABN. 
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